Labelled network subgraphs reveal stylistic subtleties in written texts
نویسندگان
چکیده
The vast amount of data and increase of computational capacity have allowed the analysis of texts from several perspectives, including the representation of texts as complex networks. Nodes of the network represent the words, and edges represent some relationship, usually word co-occurrence. Even though networked representations have been applied to study some tasks, such approaches are not usually combined with traditional models relying upon statistical paradigms. Because networked models are able to grasp textual patterns, we devised a hybrid classifier, called labelled motifs, that combines the frequency of common words with small structures found in the topology of the network, known as motifs. Our approach is illustrated in two contexts, authorship attribution and translationese identification. In the former, a set of novels written by different authors is analyzed. To identify translationese, texts from the Canadian Hansard and the European parliament were classified as to original and translated instances. Our results suggest that labelled motifs are able to represent texts and it should be further explored in other tasks, such as the analysis of text complexity, language proficiency, and machine translation.
منابع مشابه
Authorship recognition via fluctuation analysis of network topology and word intermittency
Statistical methods have been widely employed in many practical natural language processing applications. More specifically, complex networks concepts and methods from dynamical systems theory have been successfully applied to recognize stylistic patterns in written texts. Despite the large amount of studies devoted to represent texts with physical models, only a few studies have assessed the r...
متن کاملStylistic Changes for Temporal Text Classification
This paper investigates stylistic changes in a set of Portuguese historical texts ranging from the 17 to the early 20 century and presents a supervised method to classify them per century. Four stylistic features – average sentence length (ASL), average word length (AWL), lexical density (LD), and lexical richness (LR) – were automatically extracted for each sub-corpus. The initial analysis of ...
متن کاملElimination of the Elements of the Sentense in Sahife-ye-Shahi Book
Language always goes forward the brevity way, which means trying to convey its intentions by using the least number of words.The consequence of this process is contingencies such as deletion of sentence components. Poets and writers sometimes omitted some of the components of the word in order to summarize the word and, of course, to observe the principles of rhetoric, punctilios and syntactic ...
متن کاملOrdinal measures in authorship identification∗
The goal of this paper is to compare a set of distance/similarity measures, regarding theirs ability to reflect stylistic similarity between authors and texts. To assess the ability of these distance/similarity functions to capture stylistic similarity between texts, we tested them in one of the most frequently employed multivariate statistical analysis settings: cluster analysis. The experimen...
متن کاملEnsuring Stylistic Congruity in Collaboratively Written Text: Requirements Analysis and Design Issues by
Often, texts that have been written collaboratively do not \speak with a single voice." Eliminating stylistic incongruity, a di cult undertaking for both collaborative and singular writers, is the desired function of a software tool. This thesis describes the rst cycle of an iterative software development process towards meeting this goal. The user requirements are analyzed with respect to a mo...
متن کامل